Fix/pdf conversion and institution mappings by HanssonMagnus · Pull Request #3 · HanssonMagnus/bis-scraper

HanssonMagnus · 2025-11-09T12:32:02Z

No description provided.

- Fix PDF conversion to handle None/string/bytes returns from textract - Improve error handling (remove unnecessary exception re-raising) - Add Central Bank of Eswatini to institutions list - Add institution aliases: Banco de Portugal, Saudi Central Bank - Update test to reflect improved error handling

- Add Known Limitations section to README - Document PDF encoding issues in API docs - Explain that ~8% of PDFs may fail conversion due to source PDF encoding issues - Clarify this is a PDF quality issue, not a code bug

- Add recategorize_unknown_files() function to re-categorize files from unknown folder - Automatically called after scraping completes - Moves both PDFs and corresponding text files together - Uses updated institution mappings to move files to correct folders - Cleans up empty unknown folders for both PDFs and texts - Makes package more robust when institution mappings are updated - Add comprehensive test coverage (8 tests) for all scenarios - Handle edge cases: missing PDFs, missing text files, partial re-categorization

… adjust the start date for scraping from 2025-10-01 to 2025-08-01.

HanssonMagnus added 4 commits November 9, 2025 12:59

Document PDF text conversion limitation

aa064d4

- Add Known Limitations section to README - Document PDF encoding issues in API docs - Explain that ~8% of PDFs may fail conversion due to source PDF encoding issues - Clarify this is a PDF quality issue, not a code bug

Update run_full_scrape.sh to change data and log directory paths, and…

ab54017

… adjust the start date for scraping from 2025-10-01 to 2025-08-01.

HanssonMagnus merged commit 55d4d06 into main Nov 9, 2025
3 checks passed

HanssonMagnus deleted the fix/pdf-conversion-and-institution-mappings branch November 9, 2025 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pdf conversion and institution mappings#3

Fix/pdf conversion and institution mappings#3
HanssonMagnus merged 4 commits intomainfrom
fix/pdf-conversion-and-institution-mappings

HanssonMagnus commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HanssonMagnus commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant